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Abstract 

We consider a discrete latent variable model for two-way data arrays, which allows one to simul¬ 
taneously produce clusters along one of the data dimensions (e.g. exchangeable observational units 
or features) and contiguous groups, or segments, along the other (e.g. consecutively ordered times 
or locations). The model relies on a hidden Markov structure but, given its complexity, cannot 
be estimated by full maximum likelihood. We therefore introduce composite likelihood method¬ 
ology based on considering different subsets of the data. The proposed approach is illustrated by 
simulation, and with an application to genomic data. 

Key Words: Crossed-effects models; Cross validation; EM algorithm; Finite mixture models; 
Composite likelihood; Genomics. 
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1 Introduction 


Many recent applications involve large two-way data arrays in which both rows and colnmns need to be 
gronped, possibly taking into acconnt a serial dependence in one of the two dimensions. Applications 
of this type arise in several helds. For instance, in Economics, they may concern parallel time-series 
for a certain indicator recorded on a pool of conntries. In this case, one may be interested in clustering 
countries and simultaneously grouping contiguous time periods into segments corresponding to different 
phases of the economic cycle. As another example. Genomics data sets often comprise a number of 
features measured along the nuclear DNA of a species, capturing characteristics of the DNA sequence 
and/or various types of molecular activities. In these settings, one may be interested in clustering such 
features and simultaneously partitioning the genome into segments corresponding to different molecular 
activity landscapes. 

To deal with this type of “clustering-by-segmentation” problems, we introduce a statistical model 
based on associating a discrete latent variable to every row and column of a two-way data array. The row 
latent variables are assumed to be independent and identically distributed, as they refer to entities that 
are exchangeable in nature (e.g. the countries, or the genomic features). The column latent variables, 
on the other hand, refer to serially dependent entities (e.g. time periods, or locations along the nuclear 


DNA) and are assumed to follow a hrst-order homogenous hidden Markov (HM) model (Zucchini and 


MacDonald, 2009) with initial distribution equal to the stationary distribution. Given row and column 


latent variables, the observable variables are assumed to be conditionally independent and distributed 
according to laws whose parameters depend on the values of the latent variables themselves. Our 
approach is not restricted to a specihc type of outcomes, so that a generalized linear parametrization 
as in 


McGullagh and Nelder (1989) may be used to relate observable and latent variables. 


Our two-way discrete latent variable model comprises well-known models as special cases. In partic¬ 


ular, it includes the class of crossed random effect models considered by Bellio and Varin (2005) when 
these random effects are assumed to have a discrete distribution. This class of models, however, does 
not comprise any serial dependence for the column latent variables. 

The main focus of this article is likelihood-based inference for our model. As we will show, when the 
dimensions of the two-way data array are small, maximizing the full model likelihood is computationally 
feasible and may be performed by an Expectation-Maximization (EM) algorithm (Baum et ah, 1970| 
Dempster et ah, 1977) that extends the one used for HM models of longitudinal data (Bartolucci et al.| 


2013, 2014) - also named latent Markov models. However, it is easy to convince oneself that full model 


likelihood maximization is computationally unaffordable for moderate or large size arrays. In fact, 
even employing the efficient and well-known HM forward recursion by Baum and Welch (Baum et al.| 


1970 Welch, 2003), the numerical complexity of computing the full likelihood function for an array of 


2 














































dimensions r x s has order 0(s/c[/c|) - where ki and k 2 are the number of support points of row and 
column latent variables, respectively. This complexity increases exponentially with r and linearly with 
s, due the use of the aforementioned HM recursion. Notably then, data arrays with a large number 
of exchangeable rows are more problematic to deal with than those with a large number of serially 
dependent columns. 

In order to deal with the estimation problem described above, we propose a composite likelihood 


approach (Lindsay, 1988 Cox and Reid, 2004); see (Varin et ah, 2011) for a review. In particular, 
we introduce two versions of composite likelihood. The hrst, which we name row composite likelihood, 
results from ignoring dependencies between data rows due to sharing the same column latent variables. 
This composite likelihood can be maximized by an EM algorithm similar to the one used for mixed 
HM models of longitudinal data with discrete mixing distributions (Maruotti, 2011[ ). The second and 
more satisfactory version, which we name row-column composite likelihood, results from combining the 
row composite likelihood with an analogous construct for the columns; i.e. a composite likelihood in 
which one ignores dependencies between data columns due to sharing row latent variables. As we 
will show, also the row-column composite likelihood can be maximized by an EM algorithm that is 
computationally viable even for large data arrays. Our algorithms are implemented in R and available 
upon request. 

We study the finite sample properties of row and row-column composite likelihood estimators by 
simulation. Importantly, our simulation study covers also two-way arrays with small dimensions - where 
these estimators can in fact be compared with the full likelihood estimator. This gives us a chance to 
quantify the loss of efficiency due to the use of composite likelihood approximations, and to identify the 
parameters with respect to which this loss is more sizable. 

Another relevant aspect we tackle is model selection; in particular, the choice of ki and k 2 - the 
number of support points for row and column latent variables. The strategy we suggest is based on 


cross validation - extending the approach of Smyth (2000) for finite mixture models and of Celeux and 


Durand (2008) for HM models. For our model, implementing cross validation-based model selection is 


complicated by the lack of independence between any pair of observations in the data. To deal with this, 
we devise a cross validation scheme where (a) half of the “cells” in the two-way data array, identified 
randomly drawing row and column indexes, are withdrawn for use as test set, and (b) a missing-at- 
random version of the composite likelihood is used both for estimating parameters on the training set 
and for measuring fit on the test set. 

The use of our model, inference approach and model selection strategy are illustrated through an 
application in Genomics. Several recent studies (Oldmeadow et ah, 2010 Ernst et al.| 2011; ENCODE 


Consortium, 2012; Hoffman et ah, 2013) have utilized HM models to create segmentations of the human 
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genome leveraging data from inter- or intra-species comparisons, or various types of high-throughput 


genomic assays. In particular, Kuruppumullage Don et ah (2013) produced a segmentation based on the 
rates of four types of mutations estimated from primate comparisons in 1Mb (megabase) non-overlapping 
windows along the human genome. The authors also gathered and pre-processed publicly available 
data on several dozens genomic features in the same windows system. These features capture, among 
other things, aspects of DNA composition, prevalence of transposable elements, recombination rates, 
chromatin structure, methylation, transcription, etc. Producing a segmentation based on this large 
array of features could provide signihcant biological insights, all the more if one could simultaneously 
characterize their interdependencies by partitioning them into meaningful groups. The application we 
present in this article is a feasibility proof for such an endeavor; we utilize our model and methodology 
to perform “clustering-by-segmentation” on a two-way data array comprising r = 28 genomic features 
measured in s = 224 contiguous 1Mb windows along human chromosome 1. 

The reminder of the article is organized as follows. In Section we introduce the structure and 
assumptions underlying our statistical model. In Section we outline methodology for full likelihood 
estimation of the model parameters, and in Section we outline row and row-column composite like¬ 
lihood methodology. In Section we describe our simulation study, in Section we discuss model 
selection with cross validation, and in Section we present results of our application to genomic data. 
Finally, we offer some concluding remarks in Section 


2 The Model 

Consider a two-way array of random variables Yij, i = 1,... ,r, j = 1,..., s, where r is the number of 
rows and s is the number of columns. The basic assumption of our model is that these observable vari¬ 
ables are conditionally independent given two vectors Ui,... ,Ur and Di,..., 14 of row and column latent 
variables. The row latent variables (f/’s) are assumed to be independent and identically distributed 
according to a discrete distribution with u = 1,..., ki support points and mass probabilities 


\u=p{Ui = u), u = l,...,ki. 


The column latent variables (D’s) are assumed to follow a first order Markov chain with v = 1,... ,k 2 
states, initial probabilities 

Tlv=p{Vi=v), V = l,...,k2, 


transition probabilities 


T^vv = P{Vj = v\Vj_i =V), V,V = 1,.. . ,/C2, 
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and stationary probabilities 

= lim piVj = n), n = 1,..., ^ 2 . 

S^OO 

We also postnlate that initial and stationary distribntions coincide; that is 

Tl^ = pv, V = l,...,k2. (1) 

This makes the model more parsimonious as the chain can be directly parametrized by the transition 
probabilities. 

Our model specification is completed by formulating the conditional distribution of every observable 
variable Yij given the underlying pair of latent variables {Ui,Vj). In the continuous case, a natural 
assumption is that 


Yij\Ui = u,Vj = v N(V’„^,cr^), u = l,...,ki, n = 1,..., fca, 


( 2 ) 


where the 'ipuv are means depending on the latent variables and is a common variance. This results 


in a complex finite mixture of Normal distributions (Lindsay, 1995; McLachlan and Peel, 2000). Note 


that the requirement that the Normal mixture be homoschedastic is quite common in the finite mixture 
literature as it avoids degenerate solutions in terms of maximum likelihood estimates. Moreover in 
many practical applications (see for instance Section the data can be preprocessed and transformed 
as to make a homoschedastic Normal mixture suitable. 

It is important to remark that our model can be made more parsimonious incorporating knowledge 
in the form of constraints imposed on the means ipuv For instance, we could postulate that 

'tpuv = M = 1, . . . , fci, n = 1, . . . , ^ 2 . 

On the other hand, our model can be made more general allowing each Yij to depend also on observable 
covariates. For instance, we could postulate that 


Y{Yij\Ui = u, Vj = v) = i!uv + M = 1 ,..., fci, n = 1 ,..., /C 2 , 


(3) 


where the vector Xij comprises the covariates (which are assumed to be fixed and known; not random) 
and the vector (3 the corresponding regression coefficients (which are assumed not to depend on the 
latent variables). 


Along the same lines adopted in Bellio and Varin (2005), another obvious way to generalize our 


model is to replace the Normal specification ([^ for the distribution of Yij given (f/j, Vj) with any 
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exponential family distribntion nsed in Generalized Linear Models (GLMs, 
1989). For instance, for a two-way array of binary variables we may assnme 


McGnllagh and Nelder 


Yij\Ui = u,Vj = V Bernonlli(p„^), u = 1,... ,ki, v = 1,... ,k 2 , 


where the snccess probabilities puv depend on the latent variables. Then, as in a GLM, these probabilities 
conld be expressed throngh an additive parametrization and/or as a fnnction of observable covariates. 
For instance, depending on the application, it may be reasonable to postnlate that 


log 


= l\U, = u,Vj = v) 
E{Yij = 0\Ui = u,Vj = v) 


= M = 1,..., /ci, n = 1,..., 


nsing a logit link fnnction - which directly compares with (|^. 

In the following, we first introdnce fnll likelihood maximization as a means to estimate the param¬ 
eters of onr model. As this type of estimation is computationally feasible only with small arrays, we 
then switch to composite likelihood methodology. We remark that while the methods are described in 
reference to the homoschedastic Normal mixture in ([^ and under the constraint that the initial and sta¬ 
tionary distributions of the Markov chain coincide as in Q, the implementation can be streighforwardly 
generalized to deal with different parameterizations and/or to account for covariates. 


3 Full Likelihood Methodology 

Let Y denote the matrix of all the outcomes i = 1,..., r, j = 1,..., s. Also let u = (ui,..., Ur)' 
and V = {yi,... ^Vs)' denote possible conhgurations of the row and column latent variables, respectively. 
The joint density function will then be 

p{Y) = EE ■ ■ ■ ^UrPvi'^VlV2 ■ ■ ■ nn 

U V ^ J 

where </>(?/; -ip, a^) denotes the density function of a N(ip, a^), and the sums are extended to all possible 
row and column latent variables configurations u and v. Expressed this way, p(Y) can be computed 
only in trivial cases because it involves a sum over k^k^ terms. However, if the number of rows r is 
relatively small, an effective strategy is to rewrite the joint density function as 

p{Y) = ^p{Y\u)p{u), 
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where 

p{Y\u) = ^py^p{yf'>\u,vi)^ 7 r^,^^p{yf'>\u,V 2 ) ■ ■ (4) 

Vi V2 Vs 

and y^p = (j/ij,..., i/rj)' corresponds to a single column of outcomes, so that 


p{yf\u,v) = Yl(j){yij]i^uiv,(^^)- 


The density function in Q can be computed with a well-known recursion in the HM literature (Baum 
et al., 1970; Welch, 2003[ ), the numerical complexity of which increases linearly in s. Thus, the number 
of operations to compute p{Y) becomes of order sklk 2 , as already indicated in Section Belatedly, 
the log-likelihood 

e{e) = \ogp{Y), 

where 6 is short-hand notation for all model parameters, can be maximized using an EM algorithm 


(Baum et ah, 1970 Dempster et ah, 1977) which is described in detail in the following. 


3.1 EM algorithm for full likelihood estimation 

First, we introduce the complete data log-likelihood corresponding to (-{6). Consider the latent indicators 
Wiu and Zjy - for row and column latent variables, respectively. In particular, Wiu is equal to 1 if t/j = u 
and to 0 otherwise, with i = 1, .. ., r, u = 1, ..., fci, and Zj^ is similarly defined with reference to Vj. 
With some simple algebra, the complete data log-likelihood can be written as 

t{e) = a{\) + b{U) + c{^,a^), (5) 

where 

“W = EE Wiu log(A„), 

i u 

i-(n) = ^ Ziu log(p^) -I- EEE Zjuy log(7r 

V j>l V V 

c(*,o = eeee WiuZju log '^uvi ^ ); 

i j u V 

with Zjyu = ZjyZjy. lu tho above decomposition, the vector A comprises the row latent variables’ mass 
probabilities A„, the matrix 11 comprises the column latent variables’ transition probabilities tt^^, and 
the matrix comprises the means 'ipuv 
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The EM algorithm alternates two steps until convergence: 


E-step: compute the posterior expected value of each indicator variable in ([^. For i = 1,..., 
and M = 1,..., fci we set 


n 


Wiu = piUi = u\Y) = 


p{Y) 


p(Y\u)p{u), 


u:ui=u 


where the sum Ylu u =u extended to all conhgurations u with ith element equal to u. For 
j = 1,..., s and h, n = 1,..., ^2 we set 

ziv = p(Vi = n|l^) = = n|u,r)p(l^|w)p(w), 


Zjyv 


= = v\u,Y)p(Y\u)p{u), 


where the conditional probabilities p{Vj = v\u,Y) and p{Vj-i = v,Vj = v\u,Y) are obtained 


from suitable recursions (Baum et ah, 1970 Welch, 2003). Finally, for i = 1,...,n, j = 1,..., s, 


M = 1,..., fci, and n, n = 1,..., /c 2 we set 

{wiuZj^) = p{Ui = u, Vj = n|W) = Y Pi^J = Y)p(Y\u)p{u). 

' u:ui=u 

M-step: update the value of each parameter in (|^. For m = 1,..., fci we update the row mass 
probabilities as 

^ ^ Will- 

i 

Under constraint ([^, we update the transition probabilities by numerical maximization of the 
function 

Kn) = ^ Zly l0g(p„) + EEE Zjvv log(7r 

vv\ 

V j>l V V 


see also Bulla and Berzel (2008) and Zucchini and MacDonald (2009). Finally, for u = 1,... ,k 


and n = 1,..., ^2 we update means and common variance for the Normal distributions as 

1 


f^uv — 


E* E,- (wiuz. 


^ ^ ^ ^ {WiuZjv)yij 


jvj i j 




















and 


i j u V 

This algorithm runs and converges in a reasonable time if the number of rows in the two-way data array 
is r < 10 and the row latent variables are binary {ki = 2), even with a large number s of columns. 
Just to give an idea, using our R implementation on a standard personal computer, a few seconds are 
necessary to estimate the model with r = 5 rows, s = 200 columns, and binary latent variables. Again 
with binary latent variables and the same value of s, but with r = 10, the computing time increases to 
a few minutes. However, as r increases and, in particular, as the number of support points of the row 
latent variables increases, full maximum likelihood estimation becomes prohibitive and is infeasible for 
the models considered in our application (see Section]^. 


4 Composite Likelihood Methodology 


Given that the EM algorithm for full likelihood estimation is not computationally viable in typical 
applications, we propose an alternative approach based on maximizing a composite likelihood function 


where the rows are treated separately (Lindsay, 1988; Cox and Reid, 2004). In this section we introduce 


two versions of the composite likelihood function; the row composite likelihood, which is related to the 


method proposed by (Bartolucci and Lupparelli, 2015) for multilevel HM models, and the row-column 


composite likelihood. The latter is characterized by greater complexity and potentially larger estimation 
efficiency. 


4.1 Row composite likelihood estimation 

First, we consider the density function of the fth row of the data, represented as a column vector 
= {yn,... ,yisy. Given the underlying latent variable Ui, this is generated along a stationary 
hidden Markov model, so that 

PiVi cr ) ) ' ' ' ^^ ’^Vs-lVe^^iVis^ ^ )• 

V-i V2 Vs 

In practice, = u) is computed by a simplified version of the recursion used for the full likelihood 

estimation. The next step is to integrate out the latent variable Ui as to obtain 

p{yf^) = '^Kp{yf\Ui = u). 


9 












The row composite log-likelihood is defined based on this density fnnction as 


ch{0) = p{y\ 


(1)' 


( 6 ) 


Importantly, this can be readily compnted also for a large nnmber of rows, as it treats the rows as 
independent. 

In order to implement an EM algorithm to maximize ci{6), it is nsefnl to note that ([^ is the log- 
likelihood of a model that, in addition to satisfying all assnmptions in Section]^ postnlates independent 
Markov chains Va,... ,Vis nnderlying each row of data i = 1,..., r. This additional assnmption 
implies a different definition of the complete data likelihood. We now need to consider the indicator 
variables and The former have the same meaning as the Wiu introdnced in Section however 
the latter are now defined separately for each row - reflecting the strnctnre of the target fnnction in 
(5); we let z^jl eqnal to 1 if Vij = v and to 0 otherwise. Using these indicator variables, we express the 
complete data composite log-likelihood as 


cil{0) = cm (A) + chin) + cci(^', a'). 


(7) 


where 


cai(A) = log(A„), 

i 

- E 


c6,(n) 

cci(^',a^) 


5^ 4^] log(p^) + ^ 44 

V j^l V V 

54 54 54 54 4444 huv, 


I J U V 


p ^(1) ^ ^(1) ^(1) 

ijvv i,j—l,vvijv' 

The EM alternates two steps nntil convergence: 


E-step; compnte the posterior expected valne of each indicator variable in Q. Note the defini¬ 
tions in terms of posterior probabilities here hold nnder the “approximating” model in which the 
data rows are independent. For i = 1,... ,r and n = 1,..., fci we set 


44 = PiUi = u\yl'’) = 


(1)^ 


= u)Xv 


p{y 


(1)^ 


( 8 ) 


10 







Thus, for i = 1,..., r, j = 2,..., s, and u, u = 1,..., /c 2 we set 

= v\yf'^) = = v\Ui = (9) 

U 

4jlv = P{Vij-i = V, Vij = v\y[^'^) = = V, Vij = v\Ui = u, V£\ (10) 

U 

where the conditional probabilities p{Vij = v\Ui = u^yf^) and piVij = n, Vij_i = h|17j = u^y^p) 
are obtained by suitable recursions similar to the ones used in the E-step for the full likelihood. 
Finally, for = 1,..., r, j = 1,..., s, n = 1,..., /ci and n = 1,..., ^2 we set 

= PiUi = M, Vij = v\yf'^) = p{Vij =v\Ui = u, (11) 


M-step: update the value of each parameter in ([^. For u = 1,..., fci we update the row mass 
probabilities as 


Under constraint ([^, we update the transition probabilities by numerical maximization of the 
function 


MU) 


E 


^^sipv) + J2J2Y1 iog(7r^),;) 

_ U V j>l V V 


Finally, for u = 1,..., fci and n = 1,..., ^2 we update means and common variance for the Normal 
distributions as _ 

l^uv — --—^ 

E,E,K'U1U 

and _ 

- Puvf- 

i j u V 


4.2 Row-column composite likelihood estimation 

We now pass to consider a more complex composite likelihood, which takes into account also the density 

(2^\ 

function of each separate column of the data. For the jth data column represented by y) , given the 
underlying latent variable Vj, we have 

PiyflVj = v) = YlpiVijlVj = n). 
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where p{yij\Vj = v) = -0^^,(T^)A„. Thus, integrating out the latent variable Vj we obtain 

p(yf) = '^pivflVj = v)p^. 

V 

The composite log-likelihood based on this density function is 

ci2{e) = ^\ogp{yf). ( 12 ) 

3 

To estimate the parameters of our model, we propose to maximize the row-column composite log- 
likelihood defined as the sum of the row composite log-likelihood in ([^ with the above expression: 

die) = ciiio) + chio) 


In this regard, we note that (12) is the log-likelihood of a model which, in addition to satisfying all 
assumptions in Section postulates that each column of the data j = 1,..., s, depends on an 
independent sequence of latent variables Uij,... Urj also independent of each other. Moreover, this 
model assumes that the latent variables Vj, j = 1,..., s are independent and distributed according to 
the stationary distribution. Consequently, we now use the indicator variables and z)J. The latter 


-JV 


have the same meaning as the Zj^ introduced in Section 3.1, however the former are defined separately 

(2) 

for each column; we set = 1 if Uij = u and 0 otherwise. Using these indicator variables, we express 
the complete data composite log-likelihood as 


ciliO) = ca2(A) -F 062(11) -t- CC2(^', cr^). 


where 


002 

i j u 

o 62 (n) = '^'^zf^\ogip„), 

j V 


002(^',a^) 


log 0(l/p; (T^)- 


i j u V 

The EM alternates two steps until convergence: 

• E-step: We compute the same posterior probabilities as in 


, ([^, and ( p^ . 


In addition, for 
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j = 1,..., s and n = 1,..., A ;2 we set 


zfj = p{Vj = v\yy’) = 


(2)^ 


PivTlVj = n)p^, 


p{yf] 


Thus, for i = 1,..., r and j = 1,..., s, n = 1,..., /ci we set 

UV 5 ^(2) 


= PiUij = u\yf) = 


-Za. 


p{yij\yj = v) 

and for i = 1,..., r, j = 1,..., s, M = 1,..., /ci and n = 1,..., /c 2 we set 


= PiUij = U, Vj = v\yY^) 


(2)\ ^ )A{i ^(2) 


PiVijlVj = v) 


jy 


M-step; For u = 1,... ,ki we update the row mass probabilities as 


All 


Edf + EEd; 


( 2 ) 


* J 


r + rs 

We update the transition probabilities by numerical maximization of the function 


(13) 


cKn) = c()i(n) + c()2(n), 


where 

c 62 (n) = zfJ log(p„). 

j V 

Finally, for u = 1,... ,ki and n = 1,..., A ;2 we update the means and common variance of the 
Normal distributions as _ _ 

_ Ei Ej [iwlu4il) + {wiu4fv)]yij 

l^uv — --—-- --—-- 

Ei Ei {w4^4jl) + 

and 

^ + {'^144fv)]iyij - Puvf- 

i j u V 
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5 Simulation study 


We performed a simulation study to assess and compare the performance of our two approximations - 
the row and the row-column composite likelihoods - to one another and to full likelihood estimation. 


5.1 Simulation design 


We consider a benchmark design in which the two-way data array has dimensions r = 10 by s = 200, 
with two support points for both row and column latent variables {ki = /c 2 = 2). This design has 
r << s, as is perhaps typical in many applications, and is small enough for full likelihood estimation 
to be viable. We £x the model parameters as follows: 


A = (0.5,0.5)'; 

„ / 0 . 88 O 8 0.1192, 

n = I , so that p = (0.5, 0.5) ; 

\ 0.1192 0.8808' 




1 2 
3 4, 


= 0.5. 


In order to assess the behavior of the estimators under comparison, we also consider other scenarios in 
which specific elements of the benchmark design are suitably modified. In particular, we consider the 
following scenarios: 

• r = 15 instead of r = 10 and parameters fixed as above; 


• s = 400 instead of s = 200 and parameters fixed as above; 

in these scenarios there is a larger amount of information on the structure underlying, respectively, the 
serially dependent columns or the exchangeable rows. 

• ki = 3 instead of /ci = 2 and parameters fixed as above apart from A = (1/3,1/3,1/3)'; 


• k 2 = 3 instead of A ;2 = 2 and parameters fixed as above apart from 11 


/0.7870 0.1065 0.1065\ 
0.1065 0.7870 0.1065 
\0.1065 0.1065 0.7870/ 


and thus p = (1/3,1/3,1/3)'; 

in these scenarios there is a larger complexity of, respectively, the row or column latent structure. 

• = 1 instead of = 0.5 and parameters fixed as above; 

in this scenario there is a smaller separation between latent states. 
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5.2 Simulation results 


Each scenario is simulated 1,000 times independently, and bias and square root of the mean squared 
error (RMSE) for parameter estimation are computed for each estimation method - i.e. full likelihood, 
row composite likelihood, and row-column composite likelihood. Results for are reported in Table 
those for Tiyy in Table those for tpuv in Table and those for in Table In Table we also report 
median computing times in seconds, along with median absolute deviations - which are an important 
elements in comparing estimation methods. 

[Table 1 about here.] 

[Table 2 about here.] 

[Table 3 about here.] 

[Table 4 about here.] 

[Table 5 about here.] 

We see that the (row) mass probabilities Xu are well estimated by both approximations, with accuracy 
comparable to full likelihood estimation - suggesting that, in scenarios with r « s, there is enough 
information available on each row latent variable for either (and both) of the proposed composite 
likelihood approximations to accurately capture such probabilities. 

In contrast, the (column) transition probabilities are estimated with comparable accuracy by 
the two approximations, but this accuracy is lower than that afforded by full likelihood estimation - 
likely reflecting the fact that, even in the more sophisticated row-column approximation, ci 2 { 0 ) relies 
on independent data columns. 

Finally, the means 'ipuv are estimated with higher accuracy by the row-column approximation than 
by the row approximation - reflecting the fact that the former comes closer to the full likelihood. Similar 
comments apply to the estimation of 

Concerning computing time, our composite likelihood approximations are about 5-fold faster than 
the full likelihood in the benchmark design. Perhaps most importantly, when we pass to scenarios 
where r = 15 (instead of 10) or /ci = 3 (instead of 2) time increases by two orders of magnitude for 
the full likelihood. We also note that, while the median running times for the full likelihood appear 
still relatively modest (approximately 351 seconds), in some of the simulations with fci = 3 they were 
as high as 9-10 hours - notwithstanding the fact that size and complexity of the simulated data here 
are still much smaller than those one can expect in real applications (for an application of the size and 
complexity of the one in Section 7, running times for the full likelihood could be measured in months). 
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This effect of row size and structural complexity is not seen for either of our approximations. Their 
average computing times remain fairly similar across scenarios, and appear appreciably higher only 
when ^2 = 3 (instead of 2). 

In general, average times for row and row-column approximations are also similar to each other. In 
fact, in some cases (e.g. the one with ^2 = 3) the row-column approximation appears to be faster than 
the row approximation; this is due to the fact that the EM algorithm converges in a smaller number of 
iterations, even though each iteration is more time consuming by construction. 

In summary, our simulations show the row-column composite likelihood approximation to be the 
right compromise between accuracy and computational viability; it is closer to the accuracy of the full 
likelihood estimation than the row approximation, especially for estimating means and variance, but 
much cheaper than the full likelihood for large/complex data arrays - and not more expensive than the 
row approximation. 


6 Model selection 


A critical point for the model and the composite likelihood approach we propose to be useful in appli¬ 
cations, is selecting the number of support points for row {ki) and column (/C 2 ) latent variables. When 
full likelihood methods are used to estimate simpler models, the literature on model selection suggests 


information criteria such as the Akaike Information Criterion (AIC; Akaike, 1973) and the Bayesian 
Information Criterion (BIC; Schwarz, 1978) (see McLachlan and Peel ( |2000 ), Chapter 6, for a general 
discussion on selecting the number of components in finite mixture models). These criteria penalize the 
maximum log-likelihood of the model of interest with a term based on the number of free parameters, 
seen as a measure of model complexity. 

Adaptations of both the AIC and the BIC in which the maximum of the full log-likelihood is replaced 


with that of a composite log-likelihood are proposed by Varin and Vidoni (2005) and Gao and Song 


(2011). In these cases, computing the penalization term is more complicated - as it requires the Hessian 


of the composite log-likelihood function and estimation of the variance of its score; see also Bartolucci 


and Lupparelli (2015) 


Given the complexity of the model we introduced, we prefer to rely on a cross validation strategy 
similar to that in Smyth (2000) and Geleux and Durand (2008) - which avoids the matrices involved 


in the modified AIG and BIG altogether. In this regard, we note that, since we are not dealing with 
independent and identically distributed data, estimation of the composite log-likelihood score is rather 
complicated. On the other hand, cross validation can be implemented straightforwardly, requiring only 
a small amount of extra code with respect to that already developed for estimation, and a reasonable 
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computing time. 

The cross validation strategy we propose, after splitting the data into a training and a validation 
sample, treats the missing cells in either sample as “missing completely at random”. In more detail, for 
selecting ki and /c 2 we proceed as follows: 

• Split the data into a training sample Sd and a validation sample Sd by randomly drawing one 
half of the cells in the observed two-way array, and repeat this d = 1,... ,D times (e.g. D = 100 
is used in our application below). 

• For each d = 1,..., D and each pair {ki, /C 2 ) of interest, estimate the parameters in 6 based on Sd 

by maximizing under the assumption that the cells removed for validation are data 

missing completely at random. Let 6kik2{Sd) indicate the resulting estimate. 

• For each pair (/ci, k 2 ) of interest, compute 

c(!cv,kik2 

^cv,kik2 

where !{■} is the indicator function equal to 1 if its argument is true and to 0 otherwise. The hrst 
quantity is the average composite log-likelihood computed on the validation samples - considering 
for each the parameter estimates based on the corresponding training sample. The second quantity 
is the number of validation samples (out of D) for which the model with ki and k 2 support points 
reaches the highest value of the composite log-likelihood. 

As we illustrate in the application section below, these quantities provide guidance in choosing {ki, ^ 2 ); 
we would like a pair that either maximizes or reaches a value close to the maximum in terms of both 
cicv,kik2 ncv,kik2- Of course other derived quantities, as well as parsimony considerations, can and 
should be employed also (see below). 

7 A first application to genomic data 

As a hrst illustration of how our model and methodology can be used on large, complex data sets, 
we consider an application to Genomics. The data comes from a study by Kuruppumullage Don et 
al. (2013) and has been kindly provided by K.D. Makova and her group at the Pennsylvania State 
University. The authors used standard HM methodology to segment the human genome based on 


^ ^ c(^kik2{^kik2{'Sd)\Sd), 


^ ^ f > ('^kik2 (^fcifc2 l^^d) (^/ilfc2 (‘^d) f) 

~—' 1 hiM I 
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the rates of four types of mutations estimated from primate comparisons in contiguous 1Mb non¬ 
overlapping windows. To try and relate the resulting “mutational states” to the landscape of DNA 
composition and molecular activity along the genome, the authors also gathered and pre-processed 
publicly available data on several dozens genomic features in the same windows system. Here, we address 
the question of whether it is possible to produce another segmentation, based not on four mutation rates 
but on this large array of features - while simultaneously characterizing their interdependencies through 
clustering. As a feasibility proof, we thus utilize our model and methodology to perform “clustering- 
by-segmentation” on a two-way data array comprising r = 28 features measured in s = 224 contiguous 
1Mb non-overlapping windows covering human chromosome 1. 

The features, listed in Table capture aspects of DNA composition (e.g. GC content), preva¬ 
lence of transposable elements (e.g. number of LINE elements; SINE elements; DNA transposons - as 
well as their subfamilies), recombination (male and female recombination rates), chromatin structure 
(e.g. number of nuclear lamina associated regions; miRNAs; H3K4mel sites and H3K14 acetylation sites; 
Polymerase II binding sites; DNase 1 hypersensitive sites), methylation (e.g. number of non-CpG methy¬ 
lated cytosines; 5-hydroxymethylcitosines; average DNA methylation level), transcription (e.g. number 
of GpG islands; coverage by coding exons) and more. The features were standardized through normal 
scores prior to use with our approach. A representation of the data after standardization is provided in 
Figure 


[Table 6 about here.] 
[Figure 1 about here.] 


7.1 Model selection 

The hrst critical task is to select the number of support points for the row and column latent variables 
distributions {ki and ^ 2 ); that is, the number of groups in which to cluster the r = 28 genomic features 
under consideration, and the number of distinct states in which to segment the s = 224 windows covering 
chromosome 1. To perform this selection, we relied on the cross validation strategy described in Section 6 
with D = 100 iterations; Table reports the average row-column composite log-likelihood computed 
on the validation samples, using the estimates computed on the corresponding training samples; this is 
denoted by cicv,kik 2 - Table reports the number of times a certain model (i.e. combination of ki and 
^ 2 ) beat all other models in terms of composite log-likelihood on the validation samples, denoted by 
ncv,kik 2 - ^Iso report the number of free parameters for each model in Table 

[Table 7 about here.] 
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[Table 8 about here.] 
[Table 9 about here.] 


According to these results, the model achieving highest average composite log-likelihood, is the one 
with fci = 3 and k 2 = 12. This model does also well by beating all other models 5 times (out oi D = 100) 
- the maximum here is 6 , which is obtained for ki = 3, k 2 = 11 and ki = 5 and k 2 = 13. However, 
from both Table and Table we can see that several alternative {ki,k 2 ) pairs provide very similar 
performance. In addition, from Table we can see that the model with ki = 3 and k 2 = 12 has a 
very large number of free parameters compared to other models with similar performance. To provide 
an alternative quantihcation, for each model we compute an index of relative performance. In more 
detail, for every given combination of ki and k 2 we consider the average composite log-likelihood across 
cross-validation iterations, subtract the minimum of such quantity over all combinations considered, 
and divide by the difference between its maximum and minimum: 


Qkik2 


^^cv,ki k2 ^^cv,hih2 

max/jj/j2 miu/jj^/jj c£cv,hih2 


the higher this index, the better the model identihed by ki and ^ 2 - 


Table 10 reports the index values. 


[Table 10 about here.] 


The relative performance index points towards the model with ki = 3 and ^2 = 4. This model 
achieves ^ 3^4 = 0.902 (i.e. a loss of predictive power of only 10% relative to the model with ki = 3 and 
^2 = 12) while requiring only 27 free parameters (compared to 171 for ki = 3 and k 2 = 12). In fact, the 
model with ki = 3 and /c 2 = 4 is the smallest with a relative performance above 0.9. Based on cross 
validation performance and parsimony, we therefore take this as our selected model. 


7.2 Estimation results 


Next, we discuss parameter estimates for our selected model; recall that we are forming three clusters 
of genomic features (fci = 3), and segmenting chromosome 1 according to four distinct states (^2 = 4). 

reports estimates of the mass probabilities of the row latent variable distribution (A^j) and 


Table 
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estimates of the means {'il’uv)- As a convention, modalities of the row latent variable {u = 1,2,3) are 
ordered by decreasing A„ and modalities of the column latent variable (n = 1,2, 3,4) are ordered by 
increasing 


[Table 11 about here.] 
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Table 12 reports estimates of the transition probabilities and estimates of the stationary 

distribntion (p^) for the Markov process governing the colnmn latent variable. 


[Table 12 abont here.] 

Fignrej^shows a color-coded map of the predictions associated with the selected model. For each cell 
(i, j), i = 1,..., r (r = 28), j = 1,s {s = 224) of the two-way array, we (i) predict the featnre clnster 
(i.e. the row latent state) iii and the segmentation state (i.e. colnmn latent state) Vj on the basis of the 
maximum a posteriori probability (MAP), and (ii) set the cell’s predicted valne to the estimated mean 
The horizontal dimension represents the s = 224 contignons windows along chromosome 1, with 
the horizontal bar on top reporting h^’s color-coded on a green-to-blne range. The vertical dimension 
represents the r = 28 genomic featnres, with the vertical bar on the right reporting hj’s color-coded on 
a black-to-red range. Rows are rearranged gronping featnres according to the three clnsters. The inner 
part of the hgnre reports the color-coded on a green-to-red range as was done for the data in 

Fignre[^ One can therefore interpret patterns in the way low (green) and high (red) predicted valnes 
characterize different genomic featnre clnsters (as marked on the vertical bar to the right) and segments 
on the chromosome (as marked on the horizontal bar on top). 


[Fignre 2 abont here.] 


Concerning the three clnsters of genomic featnres, we note that Cluster 1 is very large, comprising 
20 features (the estimated mass probability is approximately 80%), while Clusters 2 and 3 are much 
smaller, with 3 and 5 features respectively (the estimated mass probabilities are each approximately 
10%). In more detail. Cluster 2 includes number of telomerase containing examers (a proxy for repair), 
DNA transposons (a proxy for transposition activity) and histone H3K14 acetylation sites (a proxy for 
chromatin structure). Cluster 3 includes number of non-CpG methyl-cytosines (a proxy for methyla- 
tion), nuclear lamina regions and polymerase II binding sites (proxies for chromatin structure), ALU 
elements and MER elements (proxies for transposition activity). 

Concerning the four segmentation states, we note that they cover approximately 10%, 30%, 35% and 
25% of chromosome 1, respectively (from the estimated stationary distribution). From the estimated 
means, we note that State 1, i.e. the least prevalent, is characterized by strongly depressed Cluster 1 
features, depressed Cluster 2 features and strongly elevated Cluster 3 features. State 3, i.e. the most 
prevalent, has mildly elevated levels for features in all clusters. State 2 and State 4, whose prevalences 
are more similar, have “mirroring” prohles - the former is characterized by depressed Cluster 1 features, 
strongly elevated Cluster 2 features and strongly depressed Cluster 3 features, the latter by strongly 
elevated Cluster 1 features, strongly depressed Cluster 2 features and elevated Cluster 3 features. 
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Interestingly, from Figure we can see that while all four states are represented and alternate 
along most of the chromosome, its “beginning” (approximately the hrst 50 windows towards the left 
of the hgure) shows a marked prevalence of State 4. Also interestingly. Cluster 2 shows strongly 
elevated levels in State 2 (covering approximately 30% of the chromosome), where all other features are 
depressed or strongly depressed, and strongly depressed levels in State 4 (covering approximately 25% 
of the chromosome with a prevalence in the hrst 50 windows), where all other features are elevated or 
strongly elevated. On its end. Cluster 3 shows strongly elevated levels in State 1 (covering approximately 
10 % of the chromosome), where all other features are depressed or strongly depressed. 

8 Conclusions 

In this article, we considered a discrete latent variable model for two-way data arrays, which allows 
one to simultaneously produce clusters along one of the data dimensions and contiguous groups, or 
segments, along the other. We proposed two composite likelihood approximations and their EM-based 
optimization for estimation, as well as a specialized cross validation strategy to select the number of 
support points for row and column latent variables. 

Through simulations, we showed that our composite likelihood methodology has reasonable perfor¬ 
mance in comparison with full likelihood methodology (when the latter is viable) while being much less 
computationally demanding. Our simulations also demonstrated a clear advantage of the row-column 
composite likelihood with respect to the row (only) composite likelihood in terms of estimation efficiency 
- and sometimes also in terms of computing time. 

Importantly, our methodology remains computationally viable even when, due the dimension or 
the structural complexity of the data, the full likelihood cannot be used; this is likely to happen in 
many practical applications - especially in ones involving genomic data, such as the one we presented 
in Section [3 

Another important consequence of the low computational burden of our approach is that repeated 
estimation, such as that required in cross validation, may be run in a reasonable computing time. This 
allowed us to implement model selection using a straightforward cross validation strategy. 

Our first application to genomic data, albeit preliminary, demonstrated the feasibility of using com¬ 
posite likelihood methodology to simultaneously segment long stretches of a genome and cluster large 
arrays of genomic features. For instance, we were able to identify about 50Mb at the beginning of hu¬ 
man chromosome 1 where most of the 28 genomic features we considered tend to be elevated or strongly 
elevated, but three (number of telomerase containing examers, a proxy for repair; DNA transposons, a 
proxy for transposition activity; and histone I13K14 acetylation sites, a proxy for chromatin structure) 
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tend to be strongly depressed. A similar analysis could be extended to all chromosomes and a yet 
broader set of features, unveiling important biological clues. 

Our model and methodology could also be used on many other types of complex genomic data, 
and applied to many other fields. For instance, they could be used for analyzing parallel time-series of 
economic indicators recorded on several countries (see Introduction), or data from Item Response Theory 
(rows corresponding to examinees, columns corresponding to test items administered sequentially). 

Regarding further methodological developments, we plan to explore more sophisticated forms of 
composite likelihood approximation, which may lead to additional improvements in estimation efficiency 
~ e.g. in estimating transition probabilities for the Markov process governing the column latent variable, 
for which estimation quality with our row-column and row composite likelihood appeared poorer than 
for other parameters in simulations. 
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Figure 1: Data on r = 28 features measured in s = 224 contiguous IMh non-overlapping windows 
covering human chromosome 1, after standardization. 
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Figure 2: Color-coded map of predicted genomic feature clusters (right), segmentation states for the 
windows along chromosome 1 (top) and means of each feature in each window (middle) for the se¬ 
lected model (ki = 3 and ^2 = 4j. Rows are rearranged according to the assigned clusters - Cluster 2 
comprises number of telomerase containing examers, DNA transposons and histone H3Klf acetylation 
sites; Cluster 3 comprises number of non-CpC methyl-cytosines, nuclear lamina regions, polymerase II 
binding sites, ALU elements and MER elements (Cluster 1 groups all the remaining 20 features). 


25 



















































































































































































































































































































































































































r 

S 

ki 

fc2 



Full likelihood 

Row 

comp. lik. 

Row-column comp 

. lik. 

Ai 

A2 

As 

Ai 

A2 As 

Ai 

A2 

As 

10 

200 

2 

2 

0.5 

bias 

-0.013 

0.013 


-0.013 

0.013 

-0.013 

0.013 







rmse 

0.157 

0.157 


0.157 

0.157 

0.155 

0.155 


15 

200 

2 

2 

0.5 

bias 

-0.002 

0.002 


-0.002 

0.002 

-0.002 

0.002 







rmse 

0.126 

0.126 


0.126 

0.126 

0.126 

0.126 


10 

400 

2 

2 

0.5 

bias 

0.005 

-0.005 


0.002 

-0.002 

0.003 

-0.003 







rmse 

0.149 

0.149 


0.145 

0.145 

0.145 

0.145 


10 

200 

3 

2 

0.5 

bias 

-0.001 

0.006 

-0.005 

0.001 

0.003 -0.004 

0.002 

0.002 - 

0.003 






rmse 

0.137 

0.140 

0.136 

0.135 

0.140 0.137 

0.134 

0.137 

0.135 

10 

200 

2 

3 

0.5 

bias 

0.009 

-0.009 


0.009 

-0.009 

0.009 

-0.009 







rmse 

0.150 

0.150 


0.150 

0.150 

0.149 

0.149 


10 

200 

2 

2 

1.0 

bias 

-0.007 

0.007 


-0.007 

0.007 

-0.007 

0.007 







rmse 

0.162 

0.162 


0.162 

0.162 

0.161 

0.161 



Table 1; Estimation of the A„ parameters 
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Table 2: Estimation of the Ti^y parameters 
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. Lik. 

u 

V’ui 

V’u2 

V’u3 

V’ui 

V’«2 

V’u3 

V’tii 

i>u2 

^u3 

10 

200 

2 

2 

0.5 

bias 

1 

0.003 

0.001 


-0.002 

0.006 


-0.005 

-0.012 








2 

-0.002 

-0.002 


-0.005 

0.001 


0.010 

0.002 







rmse 

1 

0.071 

0.072 


0.083 

0.084 


0.076 

0.076 








2 

0.073 

0.071 


0.082 

0.083 


0.075 

0.079 


15 

200 

2 

2 

0.5 

bias 

1 

0.002 

0.001 


-0.003 

0.004 


-0.002 

-0.003 








2 

0.000 

-0.001 


-0.004 

0.002 


0.004 

0.002 







rmse 

1 

0.027 

0.028 


0.045 

0.045 


0.035 

0.033 








2 

0.027 

0.029 


0.043 

0.044 


0.031 

0.035 


10 

400 

2 

2 

0.5 

bias 

1 

-0.003 

-0.003 


-0.004 

-0.002 


-0.009 

-0.017 








2 

-0.013 

-0.014 


-0.017 

-0.013 


0.000 

-0.009 







rmse 

1 

0.025 

0.024 


0.034 

0.040 


0.030 

0.034 








2 

0.160 

0.153 


0.170 

0.168 


0.169 

0.163 


10 

200 

3 

2 

0.5 

bias 

1 

0.009 

0.013 


0.005 

0.019 


-0.001 

-0.005 








2 

-0.024 

-0.027 


-0.029 

-0.021 


-0.013 

-0.036 








3 

-0.051 

-0.052 


-0.055 

-0.047 


-0.032 

-0.041 







rmse 

1 

0.150 

0.151 


0.158 

0.159 


0.156 

0.153 








2 

0.304 

0.302 


0.302 

0.303 


0.303 

0.304 








3 

0.319 

0.320 


0.320 

0.321 


0.316 

0.322 


10 

200 

2 

3 

0.5 

bias 

1 

0.007 

0.007 

0.005 

0.008 

0.004 

0.002 

0.005 

-0.002 - 

-0.002 







2 

0.001 

0.002 

-0.001 

0.004 

-0.002 

-0.002 

0.007 

0.012 

0.001 






rmse 

1 

0.144 

0.143 

0.133 

0.157 

0.188 

0.152 

0.145 

0.143 

0.141 







2 

0.047 

0.044 

0.043 

0.081 

0.141 

0.083 

0.051 

0.055 

0.051 

10 

200 

2 

2 

1.0 

bias 

1 

0.003 

0.003 


-0.009 

0.019 


0.011 

-0.033 








2 

0.000 

0.001 


-0.012 

0.017 


0.038 

-0.004 







rmse 

1 

0.102 

0.103 


0.130 

0.132 


0.116 

0.118 








2 

0.083 

0.083 


0.114 

0.119 


0.103 

0.098 



Table 3: Estimation of the ipuv parameters 


r 

S 

ki 

k2 

a" 


Full likelihood 

Row likelihood 

Row-column likelihood 

10 

200 

2 

2 

0.5 

bias 

-0.001 

-0.005 

-0.003 






rmse 

0.016 

0.028 

0.020 

15 

200 

2 

2 

0.5 

bias 

-0.001 

-0.004 

-0.003 






rmse 

0.013 

0.025 

0.016 

10 

400 

2 

2 

0.5 

bias 

0.000 

-0.002 

-0.002 






rmse 

0.012 

0.021 

0.015 

10 

200 

3 

2 

0.5 

bias 

-0.001 

-0.006 

-0.003 






rmse 

0.016 

0.030 

0.022 

10 

200 

2 

3 

0.5 

bias 

-0.001 

0.002 

-0.001 






rmse 

0.017 

0.039 

0.019 

10 

200 

2 

2 

1.0 

bias 

-0.002 

-0.015 

0.006 






rmse 

0.032 

0.057 

0.039 


Table 4: Estimation of 


r 

S 

fci 

k2 


Full likelihood 

Row likelihood 

Row-column likelihood 

10 

200 

2 

2 

0.5 

3.317 

(0.362) 

0.664 

(0.190) 

0.552 

(0.107) 

15 

200 

2 

2 

0.5 

188.944 

(2.761) 

0.784 

(0.219) 

0.654 

(0.139) 

10 

400 

2 

2 

0.5 

5.505 

(0.063) 

0.784 

(0.154) 

0.805 

(0.122) 

10 

200 

3 

2 

0.5 

350.976 

(112.915) 

0.501 

(0.153) 

0.539 

(0.113) 

10 

200 

2 

3 

0.5 

3.441 

(0.672) 

4.734 

(2.649) 

1.558 

(0.589) 

10 

200 

2 

2 

1.0 

3.222 

(0.612) 

1.082 

(0.426) 

0.987 

(0.234) 


Table 5: Median computing time (and median absolute deviation) in seconds. 
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i 

Feature Name 

Description 

1 

GC 

GC content 

2 

CpG 

N. CpG islands 

3 

nCGm 

N. non-CpG methyl-cytosines 

4 

LINE 

N. LINE elements 

5 

SINE 

N. SINE elements 

6 

NLp 

N. nuclear lamina associated regions 

7 

fRec 

Female recombination rates 

8 

mRec 

Male recombination rates 

9 

H3K4mel 

N. H3K4mel sites 

10 

pol2 

N. RNA polymerase-II binding sites 

11 

telomerase_hex 

N. telomerase containing hexamers 

12 

dna.trans 

N. DNA transposons 

13 

X5hMc 

N. 5-hydroxymethylcytosines 

14 

methJevel 

Average value of DNA methylation level 

15 

RepT 

Replication timing in human ES cells 

16 

mir 

N. mammalian interspersed repeat elements (subset of SINEs) 

17 

alu 

N. Alu elements (subset of SINEs) 

18 

mer 

N. mammalian dna transposons (subset of dna.trans) 

19 

11 

N. Ll-elements (subset of LINEs) 

20 

12 

N. L2-elements (subset of LINEs) 

21 

lltarget 

N. LI target sites 

22 

h3kl4ac 

N. Histone H3K14 acetylation sites 

23 

mlRNA 

N. mlRNA sites 

24 

triplex 

N. triplex motifs 

25 

inverted 

N. inverted repeats 

26 

gquadraplex 

N. G-Quadruplex structure forming motifs 

27 

dnasel 

N. dnase-1 hypersensitive sites (from ENGODE. ES cells) 

28 

cExon 

Coverage by coding exons 


Table 6: Features in the genomics data set provided by K.D. Makova and her group at the Pennsylvania 
State University (see also Kuruppumullage Don et al, 2013). 


k2 


ki 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1 

-8357.2 

-8181.6 

-8141.7 

-8131.2 

-8128.8 

-8127.8 

-8126.7 

-8125.8 

-8125.3 

-8125.2 

-8124.3 

-8123.8 

-8124.0 

-8123.8 

2 

-8342.8 

-8099.8 

-8068.5 

-8046.0 

-8030.2 

-8017.5 

-8008.7 

-8000.5 

-7997.2 

-7992.8 

-7991.3 

-7989.5 

-7988.1 

-7988.1 

3 

-8328.5 

-8096.4 

-8058.2 

-8019.3 

-8000.3 

-7986.5 

-7979.7 

-7970.3 

-7970.5 

-7969.1 

-7968.1 

-7968.9 

-7969.9 

-7969.0 

4 

-8327.7 

-8095.4 

-8053.2 

-8010.4 

-7993.1 

-7984.4 

-7977.0 

-7971.7 

-7969.7 

-7970.8 

-7972.6 

-7973.2 

-7975.1 

-7977.4 

5 

-8327.5 

-8094.8 

-8051.7 

-8005.5 

-7993.9 

-7981.7 

-7975.7 

-7974.0 

-7972.9 

-7974.5 

-7976.9 

-7979.4 

-7983.1 

-7988.1 

6 

-8327.4 

-8094.7 

-8049.7 

-8005.3 

-7989.2 

-7980.9 

-7977.3 

-7974.5 

-7975.5 

-7975.3 

-7979.3 

-7985.8 

-7989.3 

-7995.1 

7 

-8327.0 

-8093.4 

-8050.0 

-8004.2 

-7990.4 

-7982.3 

-7976.3 

-7973.6 

-7978.0 

-7982.0 

-7987.0 

-7993.6 

-7999.8 

-8006.7 

8 

-8326.5 

-8093.4 

-8049.6 

-8002.6 

-7990.6 

-7982.4 

-7977.9 

-7981.2 

-7984.1 

-7986.4 

-7994.3 

-8002.8 

-8009.2 

-8020.5 

9 

-8326.7 

-8093.5 

-8048.1 

-8002.7 

-7990.8 

-7984.5 

-7982.8 

-7982.1 

-7986.0 

-7990.4 

-7998.0 

-8005.4 

-8014.5 

-8030.5 

10 

-8326.0 

-8093.1 

-8048.7 

-8001.1 

-7989.9 

-7986.3 

-7982.1 

-7985.1 

-7992.0 

-7995.6 

-8005.6 

-8012.5 

-8028.7 

-8040.4 


Table 7: Average row-column composite log-likelihood on the validation samples for each ki and ^2 
combination, obtained from D = 100 cross validation replicates. The highest value (highlighted) is 
achieved for ki = 3, k 2 = 12 (for ki = k 2 = I, the average composite log-likelihood is —8887.8j. 











ki 








k2 







2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

3 

0 

0 

0 

0 

0 

0 

0 

4 

2 

6 

5 

5 

4 

4 

4 

0 

0 

0 

0 

0 

1 

2 

2 

1 

5 

3 

3 

3 

4 

5 

0 

0 

0 

0 

0 

0 

2 

3 

1 

0 

2 

6 

0 

3 

6 

0 

0 

0 

0 

0 

1 

1 

2 

3 

5 

0 

1 

0 

0 

7 

0 

0 

0 

0 

0 

1 

0 

3 

1 

0 

1 

0 

0 

0 

8 

0 

0 

0 

0 

0 

0 

0 

2 

0 

0 

2 

0 

0 

0 

9 

0 

0 

0 

0 

0 

1 

1 

1 

1 

0 

0 

0 

0 

0 

10 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 


Table 8: Number of times (out of 100) in which the model has the highest cross validation composite 
log-likelihood for each ki and k 2 combination. 


k2 


ki 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1 

5 

10 

17 

26 

37 

50 

65 

82 

101 

122 

145 

170 

197 

226 

2 

8 

14 

22 

32 

44 

58 

74 

92 

112 

134 

158 

184 

212 

242 

3 

11 

18 

27 

38 

51 

66 

83 

102 

123 

146 

171 

198 

227 

258 

4 

14 

22 

32 

44 

58 

74 

92 

112 

134 

158 

184 

212 

242 

274 

5 

17 

26 

37 

50 

65 

82 

101 

122 

145 

170 

197 

226 

257 

290 

6 

20 

30 

42 

56 

72 

90 

110 

132 

156 

182 

210 

240 

272 

306 

7 

23 

34 

47 

62 

79 

98 

119 

142 

167 

194 

223 

254 

287 

322 

8 

26 

38 

52 

68 

86 

106 

128 

152 

178 

206 

236 

268 

302 

338 

9 

29 

42 

57 

74 

93 

114 

137 

162 

189 

218 

249 

282 

317 

354 

10 

32 

46 

62 

80 

100 

122 

146 

172 

200 

230 

262 

296 

332 

370 


Table 9: Number of free parameters for each ki and k 2 combination. 


k2 


ki 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1 

0.577 

0.768 

0.811 

0.823 

0.825 

0.826 

0.828 

0.828 

0.829 

0.829 

0.830 

0.831 

0.830 

0.831 

2 

0.593 

0.857 

0.891 

0.915 

0.932 

0.946 

0.956 

0.965 

0.968 

0.973 

0.975 

0.977 

0.978 

0.978 

3 

0.608 

0.860 

0.902 

0.944 

0.965 

0.980 

0.987 

0.998 

0.997 

0.999 

1.000 

0.999 

0.998 

0.999 

4 

0.609 

0.862 

0.907 

0.954 

0.973 

0.982 

0.990 

0.996 

0.998 

0.997 

0.995 

0.994 

0.992 

0.990 

5 

0.609 

0.862 

0.909 

0.959 

0.972 

0.985 

0.992 

0.994 

0.995 

0.993 

0.990 

0.988 

0.984 

0.978 

6 

0.609 

0.862 

0.911 

0.960 

0.977 

0.986 

0.990 

0.993 

0.992 

0.992 

0.988 

0.981 

0.977 

0.971 

7 

0.610 

0.864 

0.911 

0.961 

0.976 

0.984 

0.991 

0.994 

0.989 

0.985 

0.979 

0.972 

0.966 

0.958 

8 

0.610 

0.864 

0.911 

0.962 

0.975 

0.984 

0.989 

0.986 

0.983 

0.980 

0.971 

0.962 

0.955 

0.943 

9 

0.610 

0.864 

0.913 

0.962 

0.975 

0.982 

0.984 

0.985 

0.980 

0.976 

0.967 

0.959 

0.949 

0.932 

10 

0.611 

0.864 

0.912 

0.964 

0.976 

0.980 

0.985 

0.981 

0.974 

0.970 

0.959 

0.952 

0.934 

0.921 


Table 10: Relative performance index for each ki and k 2 combination. Values > 0.9 (highlighted) are 
already achieved using fairly few row and column support points. 


u 


ful 

fu2 


fui 

1 

0.788 

-1.383 

-0.492 

0.134 

0.995 

2 

0.132 

-0.574 

0.832 

0.101 

-1.022 

3 

0.080 

1.015 

-0.954 

0.091 

0.421 


Table 11: Estimates of the mass probabilities of the row latent variable distribution and estimates of the 
means for the selected model (ki = 3 and k 2 = 4j- 
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'^vl 


'^v2 


TTvS 


V pv 


7ry4 


1 

0.121 

0.802 

0.196 

0.002 

0.000 

2 

0.285 

0.085 

0.713 

0.163 

0.039 

3 

0.339 

0.000 

0.171 

0.797 

0.032 

4 

0.255 

0.000 

0.000 

0.086 

0.914 


Table 12: Estimates of the transition probabilities and estimates of the stationary distribution for the 
Markov proeess governing the eolumn latent variable for the selected model (ki = 3 and k 2 = 4:). 


30 



